F/G  5/10 


/  AD— A107  135  MISSOURI  UNI V-COLUMBI A  TAILOREO  TESTING  RESEARCH  LAB 
'  PROCEDURES  FOR  CRITERION  REFERENCED  TAILORED  TESTING. (U) 

AUG  81  M  D  RECKASE  N0001«-77-C-0097 

UNCLASSIFIED 


h^l  W  *•  -***■  -**>*&**  * 


i.  Ml  MtmithiMmM  Mifc^nk 


LEVEL 


w 


f 


Procedures  for  Criterion  Referenced 
Tailored  Testing 


M&rk  D.  Reckase 


University  of  Missouri 
Columbia,  MO  6521 1 


S 


DTIC 

ELECTE 
NOV  10  1081 


Prepared  under  contract  No.  N0001 4-7^00097,  NRl5fr39S 
wNh  tfn  PkvioiiM  ond  Tnrinlf^  fbomth  Pfo^wns 
FHycbotogteal  Sctonco  Division 

Aflnaivad  far  doUc  ntoofli:  dislfhition  taiHmUwrl 
mpffoaucDon  in  wnoE  or  in  pan  a  prmiM  lor 
Ky  pufpoit  offa  Uniad  Stun  Oaanwawt 


81  11  09  159 


SECuriT/  CLASSIFICATION  OF  THIS  PAGE  (When  Pete  Entered) 


REPORT  DOCUMENTATION  PAGE 

READ  INSTRUCTIONS 

BEFORE  COMPLETING  FORM 

1.  REPORT  NUMBER  2.  GOVT  ACCESSION  NO. 

(  J  AD-  /h/d  7 

3.  RECIPIENT'S  CATALOG  NUMBER 

f3$ 

4.  TITLE  (end  Subtitle)  S - 

Final  Report  Procedures  for  criterion  refer-  ( 
j  enced  tailored  testing f 

ft.  TyPE_OF  REPQRT  ft  PERIOD  COVERED 

/^Final  Report,  1977-1981, 

!  _  _ 

6.  PERFORMING  ORG.  REPORT  NUMBER 

7.  AU  TrlORf »)  _ _ ^ 

Mark  D.^Keckase  /  ,\ 

*.  CONTRACT  ON  GRANT  NUMBERf*.) 

i  N0OO14-77-C-OO97 

*.  PERFORMING  ORGANIZATION  NAME  AND  ADDRESS 

Department  of  Educational  Psychology 

University  of  Missouri 

Columbia.  MO  65211 

10.  PROGRAM  ELEMENT.  PROJECT.  TASK 
AREA  •  WORK  UNIT  NUMBERS 

P.E.:  61153N  Proj.:  RR042-04 
T.A.:  042-04-01 

W.U.:  NR150-395 

11.  CONTROLLING  OFFICE  NAME  AND  ADDRESS 

Personnel  and  Training  Research  Programs  /// 

Office  of  Naval  Research 

12.  REPORT  DATE 

AugiHft,  1981 _ _ _ 

-■*»  NUMBER  OP  PAOCS 

11 

U  MONITORING  AGENCY  NAME  a  ADDRESS^//  different  from  Controlling  Office) 

i  J/y\L 

IS  SECURITY  CLASS,  (ol  rhlt  "port) 

Unclassified 

!■')  .  y  , 

1S«.  -OECLAftfttriCATTOW/tycrirNGRADlNG 

schedule 

16  DISTRIBUTION  STATEMENT  <cf  (hit  Report) 

Approved  for  public  release;  distribution  unlimited.  Reproduction  in  whole 
or  in  part  is  permitted  for  any  purpose  of  the  United  State?  Government. 

I  . 

17  DISTRIBUTION  STATEMENT  (of  the  ebetrect  entered  in  Block  20,  It  different  from  Report) 

16.  SUPPLEMENTARY  notes 

19.  KEY  WORDS  (Continue  on  reveree  efrte  It  neceeeery  mul  Identify  by  block  number) 

Testing  Item  Calibration 

Achievement  Testing  Linking 

Latent  Trait  Models  Sequential  Probability  Ratio  Test 

Tailored  Testing  Factor  Analysis 

V _ _ _ 

20  A>S\R  ACT  (Continue  on  reveree  elde  It  neceeeery  end  Identity  by  block  number) 

^  This  report  summarizes  the  research  findings  of  a  four  year  contract 
investigating  the  applicability  of  item  response  theory  and  tailored  testing 
to  criterion-referenced  measurement.  Six  major  areas  were  studied  on  the 
project.  These  included:  (a)  techniques  for  forming  unidimensional  item 
sets,  (b)  techniques  for  calibrating  items,  (c)  item  parameter  linking 
procedures,  (d)  comparisons  of  latent  trait  models,  (e)  tailored  testing 
procedures,  and  (f)  decision  making  procedures.  The  results  showed  that - 

DD  ,F°:%  1473  EDITION  OF  1  NOV  6S  IS  OBSOLETE 

$  'N  0102-LF-0l4-6<  01 


SECURITY  CLASSIFICATION  of  THIS  PAGE  (When  Pete  Entered) 


(XVI 


SECURI1V  CLASSIFICATION  OF  T HIS  P AGE  fWhen  D«l«  F.nt*reJ) 


--'P  factor  analytic  procedures  were  best  at  forming  unidimensional  item  pools, 
the  LOGIST  calibration  program  performed  slightly  better  than  the  ANCILLES 
program  for  item  calibration,  the  maximum  likelihood  procedure  using  the 
LOGIST  program  generally  gave  the  best  linking,  the  three-parameter  logistic 
model  was  preferred  to  the  one-parameter  model  for  tailored  testing  applica¬ 
tions,  the  maximum  likelihood  based  tailored  testing  procedure  was  slightly 
preferred  to  the  Owen's  Bayesian  based  procedure,  and  the  use  of  the  sequen¬ 
tial  probability  ratio  test  with  tailored  testing  resulted  in  substantial 
savings  in  test  length.  Overall,  tailored  testing  was  shown  to  be  feasible 
for  achievement  testing  applications.  More  detailed  results  are  described 
in  the  papers  and  reports  listed  in  this  report. 


SECURITY  CLASSIFICATION  OF~ThIS  PAGE<»Ti D»f  Entmfd) 


CONTENTS 


Introduction  . 

Formation  of  Unidimensional  Item  Sets  . 

Item  Calibration  . 

Item  Parameter  Linking  . 

Latent  Trait  Model  . 

Tailored  Testing  Procedure  . 

Decision  Making  Procedure  . 

Summary  and  Conclusions . 

References  . 


FINAL  REPORT: 

PROCEDURES  FOR  CRITERION-REFERENCED  TAILORED  TESTING 


The  purpose  of  this  contract  has  been  to  investigate  the  applicability 
of  item  response  theory  (IRT)  and  tailored  testing  to  criterion-referenced 
measurement.  Since  criterion-referenced  measurement  involves  creating  an 
item  domain,  setting  criterion  cutoffs,  and  making  a  decision  as  to  the 
location  of  an  examinee  relative  to  the  cutoff,  the  applicability  of  item 
response  theory  and  tailored  testing  had  to  be  evaluated  for  each  of  these 
components. 

The  investigation  of  the  first  component,  creating  an  item  domain,  in¬ 
volved  evaluating  procedures  for  forming  unidimensional  item  sets,  proce¬ 
dures  for  item  calibration  and  procedures  for  linking  calibrations  together 
to  form  large  item  pools.  This  was  done  since  IRT  requires  an  assumption  of 
unidimensionality,  and  large  item  pools  are  required  for  tailored  testing. 

The  setting  of  criterion  cutoffs  and  the  decision  making  aspect  of  criterion- 
referenced  testing  required  the  investigation  of  the  IRT/tailored  testing 
approach  to  achievement  testing  and  to  decision  making.  Therefore,  the 
various  IRT  and  tailored  testing  models  were  evaluated  for  achievement 
testing  applications  and  a  decision  making  procedure  was  developed  for 
tailored  testing  applications.  Each  of  these  components  will  now  be  des¬ 
cribed  in  detail  and  research  results  will  be  sutimarized. 

Formation  of  Unidimensional  Item  Sets 


Because  of  the  assumption  that  the  items  used  in  an  IRT  based  procedure 
can  be  described  in  a  unidimensional  latent  space,  it  was  considered  an  im¬ 
portant  component  of  this  project  to  evaluate  procedures  for  selecting  test 
items  to  meet  this  assumption.  Therefore,  a  study  was  planned  to  evaluate 
the  ability  of  various  procedures  to  determine  the  dimensionality  of  a  set 
of  test  items  and  to  sort  items  into  sets  measuring  a  single  dimension.  The 
procedures  evaluated  included  factor  analysis,  nonmetric  multidimensional 
scaling,  cluster  analysis,  and  latent  trait  theory  analysis.  These  proce¬ 
dures  were  applied  to  simulated  and  real  test  data  of  varying  factorial  com¬ 
plexity.  In  all  cases,  guessing  was  present  in  the  data  since  multiple 
choice  items  were  assumed  for  the  tailored  testing  application. 

The  results  of  this  study  are  reported  in: 

Reckase,  M.  D.,  The  formation  of  homogeneous  item  sets  when  guessing  is  a 
factor  in  item  response  (Research  Report  81-5) .Columbia,  MU:  Univer¬ 
sity  of  Missouri ,  August  1981. 
and  in: 

Reckase,  M.  D.  Guessing  and  dimensionality:  The  search  for  a  unidimensional 
latent  space.  Paper  presented  at  the  meeting  of  the  American  Educational 
Research  Association,  Los  Angeles,  April  1981. 

Reckase,  M.  D.  The  effect  of  guessing  in  dichotomously  scored  items  on  the 
operation  of  multivariate  data  reduction  techniques.  Paper  presented  at 
the  meeting  of  the  Psychometric  Society,  Iowa  City,  IA,  May  1980. 

The  results  indicated  that  factor  analytic  and  nonmetric  multidimensional 
scaling  techniques  could  be  used  to  sort  items  into  unidimensional  sets  and 
that  cluster  analysis  and  latent  trait  analysis  were  generally  not  appropriate. 
Of  the  factor  analytic  techniques  evaluated,  principal  factor  analysis  of 


i 
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phi -coefficients  was  found  to  give  the  best  information  for  determining  the 
dimensionality  of  a  set  of  items.  Nonmetric  multidimensional  scaling  was  ^ 

found  to  work  well  with  some  similarity  coefficients  while  giving  fairly 
meaningless  results  with  others.  The  MDSCAL  program  (Kruskal,  1964)  used 
gave  best  results  with  the  Yule's  Y  coefficient,  phi -coefficient  and  tetra-  f 

choric  correlations. 

The  cluster  analysis  procedure  was  found  to  be  inappropriate  because  of 
difficulty  in  determining  how  many  clusters  should  be  present  in  the  data. 

Both  hierarchial  and  complete  link  procedures  were  evaluated.  The  latent 
trait  procedures  worked  fairly  well,  but  were  too  cumbersome  for  general  use. 

The  procedure  involved  successive  applications  of  LOGIST  (Wood,  Wingersky, 
and  Lord,  1976)  to  a  set  of  test  items,  deleting  the  items  with  low  a-values 
after  each  run.  Unidimensional  subsets  were  usually  formed  in  this  way, 
but  as  many  as  ten  program  runs  were  required  for  each  item  set.  This  was 
clearly  impractical,  especially  considering  the  cost  of  the  program  runs. 

Item  Calibration 

Once  it  had  been  determined  that  a  set  of  items  met  the  assumption  of 
a  unidimensional  latent  space,  the  items  needed  to  be  calibrated  according 
to  one  of  the  IRT  models.  That  is,  the  parameters  of  the  model  needed  to 
be  estimated  using  one  of  the  available  computer  programs  for  that  purpose. 

There  are  several  IRT  models  that  could  be  used  for  tailored  testing  appli¬ 
cations  and  each  of  these  models  has  several  calibration  programs  for  use  in  , 

estimating  its  parameters.  One  of  the  initial  tasks  of  this  contract  was  to 
compare  the  various  models  available  and  to  evaluate  their  calibration  prog¬ 
rams  .  • 

The  results  of  a  review  of  the  literature  and  a  comparison  of  item 
calibrations  models  are  presented  in: 

Reckase,  M.  D.  Ability  estimation  and  item  calibration  using  the  one  and 
three  parameter  logistic  models:  A  comparative  study  (Research  Report 
7^-1).  Columbia,  MO:  University  of  Missouri,  November  1977. 
and 

McKinley,  R.  L.  &  Reckase,  M.  D.  A  comparison  of  the  ANCILLES  and  LOGIST 
parameter  estimation  procedures  for  the  three -parameter  logistic  model 
using  goodness  of  fit  as' T  criterion  (Research  Report  80-2) .  Columbia , 

MO:  University  of  Missouri 7  December  1980. 

Other  papers  related  to  this  topic  are: 

Reckase,  M.  D.  Unifactor  latent  trait  models  applied  to  multifactor  tests: 

Results  and  implications.  Journal  of  Educational  Measurement,  1979, 

4(3),  207-230. 

McKinley,  R.  L.  &  Reckase,  M.  D.  The  fit  of  ICC's  based  on  two  different 
three-parameter  logistic  model  parameter  estimation  procedures.  Paper 
presented  at  the  meeting  of  the  Psychometric  Society,  Chapel  Hill,  N.C., 

May  1981. 

Reckase,  M.  D.  The  validity  of  latent  trait  models  through  the  analysis  of 
fit  and  invariance.  Paper  presented  at  the  meeting  of  the  American 
Educational  Research  Association,  Los  Angeles,  April  1981. 

Reckase,  M.  D.  A  comparison  of  the  one-  and  three-parameter  logistic  models 
for  item  calibration.  Paper  presented  at  the  meeting  of  the  American 
Educational  Research  Association,  Toronto,  March  1978. 
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Reckase,  M.  D.  Univariate  latent  trait  models  applied  to  multivariate  measures. 

Paper  presented  at  the  meeting  of  the  Psychometric  Society,  Chapel  Hill , 

N.C. ,  May  1977. 

The  results  of  the  research  on  calibration  techniques  can  be  divided  into 
two  parts:  the  comparison  of  calibration  models,  and  the  comparison  of  calib¬ 
ration  programs  for  a  single  model.  The  comparison  of  calibration  models 
concentrated  on  the  one-  and  three -parameter  logistic  models.  The  results  of 
the  research  on  the  calibration  models  showed  that  the  three-parameter  model 
fit  empirical  data  better  than  the  one-parameter  model,  but  that  the  sample 
size  required  to  use  the  three-parameter  model  was  substantially  larger  than 
for  the  one-parameter  model.  Lack  of  fit  was  most  prominent  for  the  one- 
parameter  model  when  guessing  was  a  factor  in  item  response. 

The  models  were  also  found  to  yield  ability  estimates  that  measured 
different  constellations  of  ability  when  items  tapping  several  dimensions 
were  used  in  a  test.  The  one-parameter  logistic  based  ability  estimates  were 
related  to  the  sum  of  the  components  present  in  a  test,  while  the  three- 
parameter  logistic  based  ability  estimates  were  found  to  be  mainly  related 
to  the  single  largest  component  in  a  test.  This  difference  in  ability  esti¬ 
mates  is  due  to  differences  in  the  weighting  of  the  item  responses  for  the 
two  models.  Unit  weights  are  used  for  the  one-parameter  model,  while  the 
items  are  weighted  by  the  item  discrimination  parameter  estimates  for  the 
three -parameter  model.  Despite  these  differences,  the  ability  estimates 
obtained  from  the  models  were  found  to  be  highly  correlated  for  many  tests 
composed  of  a  fixed  set  of  items.  The  controlling  factor  seemed  to  be  the 
magnitude  of  the  first  principal  component  of  the  test.  When  the  item  calib¬ 
ration  results  from  multiple  choice  tests  were  to  be  used  for  tailored  testing 
purposes,  the  three-parameter  logistic  model  was  found  to  be  superior  because 
of  the  better  fit  to  empirical  data.  The  ability  to  approximate  guessing 
effects  was  found  to  be  especially  important  because  the  error  induced  in 
the  one-parameter  item  parameter  estimates  by  this  factor. 

Among  the  numerous  available  item  calibration  procedures,  the  ANCILLES 
(Urry,  1978)  and  L06IST  (Wood,  Wingersky  &  Lord,  1976)  procedures  were  selected 
for  comparison  on  this  project  because  of  their  wide  useage  by  the  testing 
community.  The  results  of  the  research  showed  that  the  ICC  estimates  from 
the  L0G1ST  program  fit  the  empirical  item  data  slightly  better  than  those  from 
the  ANCILLES  program.  For  this  reason,  the  LOGIST  program  was  suggested  for 
item  calibration  for  use  with  tailored  testing  procedures. 

In  addition  to  comparing  existing  calibration  models  and  programs,  a  new 
estimation  procedure  was  developed  on  the  project  by  Robert  Tsutakawa  and 
Steve  Rigdon.  This  new  procedure  is  described  in  detail  in  the  report: 

Rigdon,  S.  E.  &  Tsutakawa,  R.  K.  Estimation  in  latent  trait  models  (Research 

Report  80-1).  Columbia,  MO:  University  of  Missouri,  May  T98l . 

The  investigation  focused  on  the  estimation  of  ability  and  item  parameters 
for  a  class  of  binary  response  models,  including  the  commonly  used  logistic 
and  probit  models.  Estimation  procedures  were  examined  for  variations  of  these 
models  depending  on  whether  only  the  ability  parameters  or  both  ability  and 
item  parameters  are  assumed  random  with  prior  distributions  with  fixed  but 
unknown  hyperparameters . 


When  the  item  parameters  are  fixed  and  the  ability  parameters  are 
random,  the  EM  algorithm  can  be  readily  adapted.  Estimates  of  ability 
parameters  are  easily  found,  even  in  situations  where  maximum  likelihood 
estimates  do  not  exist.  Simulation  studies  have  shown  that  these  esti¬ 
mates  are  more  efficient  than  maximum  likelihood  estimates  in  terms  of  the 
mean  square  error  criterion.  The  EM  algorithm  can  be  modified  to  esti¬ 
mate  the  item  parameters  by  conditioning  on  the  expected  ability  para¬ 
meters.  This  revised  method  is  computationally  much  cheaper  while  perfor¬ 
ming  as  well  as  the  straight  EM  algorithm.  Although  most  of  the  numerical 
work  has  been  restricted  to  the  one-parameter  logistic  (Rasch)  model, 
with  a  normal  prior,  the  method  extends  to  mul ti -parameter  models.  Com¬ 
puter  programs  for  the  two-parameter  (for  ability  and  guessing)  logistic 
model  are  now  being  developed. 

When  both  item  and  ability  parameters  are  considered  random,  the  EM 
algorithm  applies  in  principle  but  cannot  be  easily  implemented  since  the 
random  variables  do  not  have  distributions  belonging  to  exponential  families. 
The  algorithm  was  modified  by  alternately  estimating  the  ability  and  item 
parameters  while  holding  one  set  fixed  at  its  posterior  expectation.  Simu¬ 
lated  results  assuming  normal  priors  indicated  that  the  resulting  estimators 
do  not  perform  as  well  as  the  maximum  likelihood  estimator.  This  discrepancy 
disappears,  however,  when  the  prior  distribution  of  the  difficulty  parameter 
was  assumed  uniform. 


The  extent  to  which  the  models  used  here  are  applicable  in  practice 
remains  to  be  seen.  Some  preliminary  work  was  done  on  goodness  of  fit 
tests.  Though  it  appears  that  the  classical  chi-square  methods  apply  when 
ability  parameters  are  from  a  common  prior  distribution,  the  amount  of  com¬ 
putation  needed  for  even  a  moderate  number  of  items  may  be  prohibitive. 

A  more  feasible  approach  may  be  through  the  logarithmic  penalty  function 
mentioned  by  Mosteller  and  Wallace  (1964)  in  their  book  on  the  Federalist 
Papers  and  examined  in  more  detail  by  Efron  (1978). 

Two  other  areas  included  in  the  original  research  objectives  were  com¬ 
paring  item  response  curves  and  designing  sequential  methods  for  mental 
testing.  Work  was  limited  since  the  procedures  depend  on  the  results  from 
the  EM  algorithm  study. 

Item  Parameter  Linking 

Once  item  parameter  estimates  had  been  obtained  from  a  series  of  group 
tests,  they  needed  to  be  placed  on  the  same  scale  (linked)  so  all  of  the 
items  could  be  used  in  the  tailored  testing  item  pool.  Numerous  procedures 
have  been  developed  for  this  linking  task.  A  natural  extension  of  the  re¬ 
search  on  models  and  calibration  procedures  was  to  evaluate  linking  proce¬ 
dures  to  determine  which  gave  the  most  accurate  parameter  estimates  for 
use  with  tailored  testing.  The  results  of  the  evaluation  were  reported  in 
the  following  report: 

McKinley,  R.  L.  &  Reckase,  M.  D.  A  comparison  of  procedures  for  constructing 
large  item  pools  (Research  Report  81-3).  Columbia,  MO:  University  of 
Missouri,  August  1981. 

Some  of  the  results  of  this  research  were  also  reported  in: 

Reckase,  M.  D.  Item  pool  construction  for  use  with  latent  trait  models.  Paper 
presented  at  the  meeting  of  the  American  Educational  Research  Association, 
San  Francisco,  April  1979. 
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The  basic  design  for  this  part  of  the  research  effort  was  to  sample  a 
series  of  short  tests  from  a  long  test,  link  the  calibrations  of  the  short 
tests,  and  then  compare  the  linked  parameter  estimates  to  those  obtained 
from  the  long  test.  This  procedure  was  used  to  evaluate  linking  procedures 
for  both  the  one-parameter  and  three-parameter  models  and  to  determine  the 
necessary  sample  size  and  number  of  common  items  between  tests  for  linking 
to  be  performed.  The  MAX  calibration  program  (Wright  &  Panchapakesan ,  1969) 
was  used  for  the  one-parameter  model  and  the  ANCILLES  and  LOGIST  programs 
were  used  for  the  three-parameter  model . 

The  results  of  the  research  showed  that  far  fewer  cases  were  required 
to  link  the  single  parameter  of  the  one-parameter  model  than  were  needed 
for  the  parameters  of  the  three-parameter  model .  Approximately  2000  cases 
seemed  to  be  needed  in  the  latter  case.  Of  the  four  linking  procedures 
evaluated  for  the  three-parameter  logistic  model,  maximum  likelihood  linking 
using  the  LOGIST  program  gave  the  best  results  overall  when  a  2000  sample 
was  used.  Fifteen  items  in  common  between  test  forms  were  sufficient  for 
adequate  linking.  Future  research  should  address  the  quality  of  items  that 
should  be  in  common  between  tests. 

Latent  Trait  Model 


Once  an  item  pool  has  been  produced  using  the  calibration  and  linking 
methods  described  above,  the  actual  tailored  testing  process  can  begin.  To 
define  that  process,  one  of  the  many  latent  trait  models  must  be  selected 
as  a  basis  for  the  tailored  testing  procedure.  Of  the  many  models  available, 
the  one-  and  three-parameter  logistic  models  were  evaluated  for  use  in  this 
project.  A  series  of  three  tailored  testing  studies  were  run  to  determine 
which  of  these  two  models  gave  the  most  accurate  ability  estimates  in  a 
realistic  testing  setting.  The  following  reports  describe  these  studies  in 
detail . 

Koch,  W.  R.  &  Reckase,  M.  D.  A  live  tailored  testing  comparison  study  of 
the  one-  and  three-parameter  VogistTcmodels  (Research  Report  78-1) . 
CcTlumEria ,  Mfr:  University  of  Missouri,  June  1978. 

Koch,  W.  R.  &  Reckase,  M.  D.  Problems  in  application  of  latent  trait  models 
to  tailored  testing  ( Res earch~Re port  79-lJ.  Columbia,  M0:  University 
of  Missouri,  September  1979. 

McKinley,  R.  L.  &  Reckase,  M.  D.  A  successful  application  of  latent  trait 

theory  to  tailored  achievement  testing  (Research  feeport  80-1).  CoTumb i a , 
MCH  University  of  Missouri,  February 1980. 

Other  papers  written  on  this  topic  were: 

McKinley,  R.  L.  &  Reckase,  M.  D.  Computer  application  to  ability  testing. 

AEDS  Journal ,  1980,  13_(3),  193-203. 

Reckase,  M.  D.  Procedures  for  computerized  testing.  Behavior  Research 
Methods  and  Instrumentation,  1977,  9(2),  148-152. 

English,  R.  A.,  Reckase,  M.  D7  &  Patience,  W.  M.  Application  of  tailored 
testing  to  achievement  measurement.  Behavior  Research  Methods  and 
Instrumentation,  1977,  9_(2),  158-161. 

Reckase,  M.  D.  Tailored  testing,  measurement  problems,  and  latent  trait 
theory.  Paper  presented  at  the  meeting  of  the  National  Council  on 
Measurement  in  Education,  Los  Angeles,  April  1981. 


i 


-6- 


Patience,  W.  M.  &  Reckase,  M.  D.  Self-paced  versus  paced  evaluation  utilizing 
computerized  tailored  testing.  Paper  presented  at  the  meeting  of  the 
National  Council  on  Measurement  in  Education,  Toronto,  March  1978. 

Reckase,  M.  D.  Computerized  achievement  testing  using  the  simple  logistic 

model.  Paper  presented  at  the  meeting  of  the  American  Educational  Research 
Association,  New  York,  April  1977. 

The  one-  and  three-parameter  logistic  based  tailored  testing  procedures 
used  in  these  studies  were  both  based  on  maximum  information  item  selection 
and  maximum  likelihood  ability  estimation.  The  criteria  for  evaluation  of 
the  ability  estimates  obtained  from  the  procedures  were  the  information  func¬ 
tion  and  reliability  coefficients  obtained  when  the  procedures  were  applied 
in  a  realistic  setting.  Both  vocabulary  and  achievement  items  were  used  in 
the  evaluation.  The  populations  used  for  the  studies  were  upper  level  college 
students. 

The  overall  results  obtained  from  the  series  of  studies  showed  that  a 
tailored  testing  procedure  based  on  the  three-parameter  logistic  model  gave 
both  higher  information  values  and  higher  reliability  coefficients  than  the 
one-parameter  model.  The  predictive  validity  of  the  ability  estimate  using 
scores  on  classroom  achievement  tests  as  a  criterion  was  found  to  be  about 
equal  for  the  two  models.  Twenty  item  tailored  tests  were  found  to  give 
about  equivalent  reliability  to  50  item  traditional  tests  on  the  same  material. 

An  important  finding  of  the  live  testing  research  on  tailored  testing  was 
the  determination  of  the  sensitivity  of  the  procedures  to  the  accuracy  of  the 
item  calibration  information.  When  item  parameter  estimates  were  poor,  the 
procedures  gave  meaningless  results,  regardless  of  the  quality  of  the  test 
items.  Inaccurate  parameter  estimates  also  made  the  information  function 
meaningless.  High  information  values  were  sometimes  obtained  for  tests  with 
low  reliabilities.  These  results  point  out  the  critical  importance  of  item 
calibration  and  linking. 

In  addition  to  evaluating  the  quality  of  ability  estimates  using  the  two 
procedures,  research  was  done  to  determine  the  best  way  to  operate  the  tailored 
testing  procedure.  This  research  involved  determining  the  composition  of  the 
item  pool,  the  appropriate  place  in  the  pool  to  start  administering  items,  and 
the  item  selection  procedure  to  use  before  ability  estimates  were  available. 

The  results  of  this  research  are  reported  in  the  following  reports  and  papers. 
Patience,  W.  M.  &  Reckase,  M.  D.  Effects  of  program  parameters  and  item  pool 
characteristics  on  the  bias  of  a  three -parameter  tailored  testing  proce¬ 
dure.  Paper  presented  at  the  meeting  of  the  National  Council  on  Measure¬ 
ment  in  Education,  Boston,  April  1980. 

Patience,  W.  M.  &  Reckase,  M.  D.  Operational  characteristics  of  a  one-parameter 
tailored  testing  procedure  (Research  Report  79-2).  Columbia,  MO:  Uni7- 
ersity  of'  Missouri ,  October  1979. 

Patience,  W.  M.  &  Reckase,  M.  D.  Operational  characteristics  of  a  Rasch  model 
tailored  testing  procedure  when  program  parameter  and  item  pool  attributes 
are  varied.  Paper  presented  at  the  meeting  of  the  National  Council  on 
Measurement  in  Education,  San  Francisco,  April  1979. 
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The  methodology  used  for  this  research  was  to  simulate  the  testing 
process  for  a  hypothetical  examinee  of  known  ability  and  determine  the  mean 
and  standard  deviation  of  the  obtained  ability  estimates.  The  procedure 
was  considered  acceptable  if  the  resulting  estimates  were  unbiased  and  had 
a  small  variance.  An  analytic  procedure  that  traced  all  possible  paths 
through  the  item  pool  for  a  person  of  known  ability  was  also  used  to  determine 
the  statistical  bias  and  variance  of  estimates. 

The  results  of  this  research  indicated  that  the  characteristics  of  the 
item  pool  are  important  in  determining  the  quality  of  the  ability  estimates. 
The  item  pool  should  have  a  rectangular  distribution  of  item  difficulties, 
and  a  uniform  level  of  item  discrimination.  Items  with  low  discrimination 
parameter  estimates  are  not  selected  by  the  tailored  testing  procedure  so 
they  should  not  be  included  in  determining  the  size  of  the  active  item  pool. 

It  was  also  found  to  be  important  that  the  difficulty  scale  tie  uniformly 
covered  by  items.  If  gaps  in  the  coverage  were  present,  regions  of  the 
ability  scale  would  be  poorly  estimated. 

Other  recommendations  can  be  made  based  on  this  research.  First,  the 
initial  ability  estimate  used  to  start  the  testing  session  should  be  one 
that  selects  a  first  item  of  about  median  difficulty  since  it  is  important 
that  enough  items  are  present  both  above  and  below  the  initial  ability  to 
give  good  estimation.  Also,  when  using  a  maximum  likelihood  ability  esti¬ 
mation  procedure,  the  stepsize  used  before  an  estimate  is  obtained  should 
be  approximately  .7  for  the  one-parameter  procedure  and  .3  for  the  three- 
parameter  procedure.  Otherwise,  the  examinee's  ability  estimate  may  move 
out  of  the  range  where  items  are  present  before  a  good  ability  estimate  can 
be  obtained. 

The  recommendations  given  above  should  only  be  considered  as  rough  guide¬ 
lines  because  of  the  complex  interaction  of  the  variables  controlling  the 
tailored  testing  situation.  The  best  recourse  is  to  simulate  the  operation 
of  about  50  examinees  at  numerous  points  along  the  ability  scale  using  the 
actual  item  parameters  from  the  item  pool  to  be  used.  This  procedure  can 
be  used  to  determine  the  accuracy  of  ability  estimates  that  can  be  obtained. 
The  controlling  parameters  of  the  tailored  testing  program  can  then  be  fine 
tuned  to  the  item  pool. 


Tailored  Testing  Procedure 

Based  upon  research  reported  up  to  this  point,  the  three-parameter 
logistic  model  is  a  clear  choice  over  the  one-parameter  logistic  model  for 
tailored  testing  applications.  However,  there  are  two  commonly  used  proce¬ 
dures  for  applying  the  three -parameter  model  to  tailored  testing,  Owen's 
Bayesian  procedure  (Owen,  1975)  and  the  maximum  likelihood  procedure,  and 
little  work  has  been  done  to  directly  compare  the  two.  A  live  testing  study 
comparing  these  two  procedures  was  conducted  on  this  project  to  obtained 
information  relevant  to  choosing  between  them.  The  results  of  the  study 
are  given  in  the  following  report  and  paper: 

McKinley,  R.  L.  &  Reckase,  M.  D.  A  comparison  of  a  Bayesian  and  a  maximum 
likel  ihood  tailored  testing  procedure  (besearch  Report  &I-2) .  Co) umEi a , 
FTCFi  University  of  Missouri,  August ^1981. 

Rosso,  M.  A.  &  Reckase,  M.  D.  A  comparison  of  a  maximum  likelihood  and  a 
Bayesian  ability  estimation  procedure  for  tailored  testing.  Paper  pre¬ 
sented  at  the  meeting  of  the  National  Council  on  Measurement  in  Educa¬ 
tion,  Los  Angeles,  April  1981. 
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This  study  compared  the  reliability  coefficients,  information  functions 
and  ability  estimates  fo~  achievement  tests  administered  using  either  Owen's 
Bayesian  procedure  or  a  maximum  likelihood  tailored  testing  procedure.  The 
Bayesian  procedure  selected  items  to  minimize  the  posterior  variance  of  the 
ability  estimates  and  estimated  ability  as  the  mean  of  the  posterior  distri¬ 
bution  of  ability.  The  maximum  likelihood  procedure  selected  items  to  maxi¬ 
mize  the  information  function  at  the  most  recent  ability  estimate  and  esti¬ 
mated  ability  using  an  empirical  maximum  likelihood  approach. 

The  results  of  the  study  showed  that  the  two  procedures  had  approximately 
equal  reliabilities  and  information  functions.  However,  a  definite  regres¬ 
sion  effect  was  found  to  be  a  result  of  the  Bayesian  prior.  In  this  study, 
the  prior  was  assumed  to  be  normal  with  a  mean  near  the  median  difficulty  of 
the  item  pool  and  a  variance  of  1.0.  Since  the  prior  mean  was  somewhat  lower 
than  the  ability  of  the  group  tested,  the  ability  estimates  were  artificially 
kept  lower  than  the  maximum  likelihood  estimates.  Because  of  this  effect, 
the  maximum  likelihood  procedure  was  recommended  if  accurate  prior  informa¬ 
tion  were  not  available. 

Decision  Making  Procedure 

The  final  project  undertaken  on  this  contract  was  to  investigate  deci¬ 
sion  making  procedures  for  use  with  tailored  testing.  The  most  convenient 
procedure  found  for  use  with  tailored  testing  was  based  on  Wald's  (19*7) 
sequential  probability  ratio  test.  Research  was  done  on  the  contract  to 
determine  the  usefulness  of  such  a  procedure.  The  results  of  the  research 
effort  are  presented  in  the  following  reports  and  papers. 

Reckase,  M.  D.  The  use  of  the  sequential  probability  ratio  test  in  making 
grade  classifications  in  conjunction  with  tailored  testing  (ResearcF- 
Report  81-4).  Columbia^,  MO:  University  of  Missouri,  August  1981. 
Reckase,  M.  D.  Some  decision  procedures  for  use  with  tailored  testing.  In 
D.  J.  Weiss  (Ed.),  Proceedings  of  the  1979  computerized  aaaptive  testing 
conference.  Minneapolis,  University  of  Minnesota,  1980. 

Reckase,  M.  D.  An  application  of  tailored  testing  and  sequential  analysis 
to  classification  problems.  Paper  presented  at  the  meeting  of  the 
American  Educational  Research  Association,  Boston,  April  1980. 

Reckase,  M.  D.  A  generalization  of  sequential  analysis  to  decision  making 
with  tailored  testing.  Paper  presented  at  the  meeting  of  the  Military 
Testing  Association,  Oklahoma  City,  OK:  November  1978. 


The  SPRT  procedure  was  investigated  for  use  with  tailored  testing  using 
both  simulation  and  live  testing  techniques.  Use  of  the  SPRT  with  both  the 
one-parameter  and  three-parameter  models  was  studied.  The  results  showed  that 
a  substantial  reduction  of  test  length  could  be  attained  through  the  use  of 
the  SPRT  without  loss  of  decision  accuracy.  As  with  previous  work,  the  three- 
parameter  procedure  was  found  to  yield  better  results  than  the  one-parameter 
procedure,  the  detrimental  effects  of  guessing  on  the  operation  of  the  one- 
parameter  logistic  based  tailored  testing  procedure  were  an  especially  impor¬ 
tant  factor  in  the  results. 


-9- 


Summary  and  Conclusion 

This  research  project  studied  many  facets  of  the  application  of  IRT 
and  tailored  testing  to  achievement  measurement.  Included  were  studies  of 
techniques  for  sorting  items  into  uni  dimensional  item  sets,  calibrating 
test  items,  linking  item  calibrations,  estimating  ability,  selecting  items 
for  tailored  testing,  and  making  decisions  using  tailored  testing.  Overall  , 
the  results  were  fairly  positive.  Unidimensional  item  sets  can  be  formed 
using  the  principal  factor  technique  on  phi  coefficients  and  the  items  can 
be  calibrated  for  use  with  tailored  testing  using  the  LOGIST  program  if  a 
sufficient  sample  of  individuals  is  available.  The  calibration  of  separate 
tests  can  be  linked  to  produce  a  large  item  pool  if  at  least  15  items  are 
in  common  between  the  tests  and  a  sample  of  performance  for  at  least  2000 
individuals  is  available  for  each  test.  The  maximum  likelihood  procedure 
using  the  LOGIST  program  is  recommended  for  the  linking.  The  three-para¬ 
meter  logistic  model  has  been  shown  to  give  an  adequate  theoretical  basis 
for  tailored  testing,  even  for  achievement  testing  which  does  not  quite 
meet  the  assumptions  of  the  model.  A  tailored  testing  model  based  on  maxi¬ 
mum  information  item  selection  and  maximum  likelihood  ability  estimation 
is  recommended  for  use,  with  an  expected  result  of  reducing  the  number  of 
items  required  to  obtain  a  test  reliability  equal  to  that  of  a  traditional 
test  more  than  twice  as  long.  Accurate  decision  making  has  been  shown  to 
be  possible  using  the  sequential  probability  ratio  test  with  tailored  testing 
with  a  substantial  reduction  in  the  number  of  test  items  administered  and 
tight  control  of  errors  of  classification. 

idith  all  of  these  positive  results,  there  are  still  many  areas  in  which 
the  user  of  tailored  testing  must  exercise  extreme  caution.  Unlike  tradi¬ 
tional  paper  and  pencil  testing,  tailored  testing  is  critically  dependent 
on  the  quality  of  the  calibration  and  linking  of  the  item  pool.  If  the  item  .. 
parameters  are  poorly  estimated,  the  item  selection  procedure  and  ability 
estimation  procedure  will  be  operating  on  meaningless  numbers  and  will  tend 
to  give  meaningless  results.  The  situation  is  equivalent  to  determining 
the  length  of  a  line  with  a  ruler  that  has  its  units  marked  off  in  the  wrong 
places.  Trusting  the  calibration  too  much  can  give  test  results  that  look 
good  (i.e.,  have  high  information  functions),  when  the  reliability  of  the 
scores  is  in  fact  very  low.  As  a  result,  tailored  tests  should  still  be 
evaluated  for  quality  using  procedures  independent  of  the  item  parameters, 
such  as  test-retest  reliability. 

The  use  of  the  three-parameter  logistic  model  as  a  basis  for  ability 
estimation  causes  some  subtle  problems  in  test  score  interpretation.  Using 
this  model  causes  item  responses  to  be  weighted  by  the  discrimination  para¬ 
meter  estimates  when  computing  an  estimate  of  ability.  This  gives  high  weight 
to  those  items  measuring  the  major  component  of  a  test  and  very  low  weight 
to  items  not  measuring  that  component.  The  result  is  an  ability  estimate 
measuring  a  trait  that  is  more  unidimensional  than  is  obtained  from  a  number 
correct  score.  This  difference  must  be  taken  into  account  when  making  use 
of  tailored  testing. 
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Tne  use  of  tailored  testing  for  measurement  is  similar  in  many  ways 
to  the  use  of  computers  for  computation.  The  techniques  give  high  power 
and  efficiency  based  on  high  technology.  But  a  price  is  paid  for  the  ad¬ 
vantages.  The  price  is  a  greater  sensitivity  to  the  input  to  the  proce¬ 
dure  and  a  greater  dependence  on  complicated  hardware.  The  results  of 
this  research  contract  definitely  show  that  tailored  testing  can  be  applied 
to  achievement  measurement  with  many  advantages.  However,  the  technique 
cannot  be  applied  carelessly  and  still  achieve  those  advantages. 
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